Introduction¶
Welcome to our brief exploration of machine learning topics through one of the most widely used and readily available datasets. The stock market!
The Plan¶
The plan is to:
- Explore some company's stock data independently via visuals, selections of metrics and some cursory statistical analysis.
- Explore connections between the stock trends of similar corporations in the tech industry.
- Consider a question I have as to whether or not forecasting is truly improved and by how much if we use similar companies past data to try and predict a target companies future stock prices.
Terminlogy and Tech Stack¶
Arima models:
- ARIMA, which stands for AutoRegressive Integrated Moving Average, is a class of models that explains a given time series based on its own past values, that is, its own lags and the lagged forecast errors.
adfuller from statsmodels.tsa.stattools:
- The Augmented Dickey-Fuller test (
adfuller) is a type of statistical test called a unit root test. It's used to determine the presence of a unit root in a time series sample, which can help to understand if the time series is stationary or not.
- The Augmented Dickey-Fuller test (
autocorrelation:
- Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. In time series data, it's used to determine if a data set or time series is random or if there are underlying patterns.
arima model:
- The ARIMA model, or AutoRegressive Integrated Moving Average model, is a popular time series forecasting model that combines the ideas of autoregression (AR) and moving averages (MA). It aims to describe the autocorrelations in the data.
statsmodels.graphics.tsaplots import plot_acf, plot_pacf:
plot_acfandplot_pacfare functions from thestatsmodelslibrary used to plot the autocorrelation function (ACF) and partial autocorrelation function (PACF) of a time series, respectively. These plots are useful for determining the order of AR and MA terms in an ARIMA model.
#!pip install yfinance
Apples, Oranges and Data Exploration¶
Now that we have downloaded the necessary packages let's start investigating some data for specific companies.
The Apple¶
Since I am quite the Apple fan I am going to start off with checking out how Apple has been doing over the years.
We are going to import the yfinance package then find its ticker name (the way it's entries are referred to in the csv file and extract a "sub dataframe" with the stock prices using the ticker name.
import yfinance as yf
# Define the ticker symbol
tickerSymbol = 'AAPL'
# Get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
# Get the historical prices for this ticker
tickerDf = tickerData.history(period='5y')
# See the data
tickerDf
| Open | High | Low | Close | Volume | Dividends | Stock Splits | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2018-08-06 00:00:00-04:00 | 49.695115 | 49.993764 | 49.472922 | 49.950760 | 101701600 | 0.0000 | 0.0 |
| 2018-08-07 00:00:00-04:00 | 50.010491 | 50.053495 | 49.398856 | 49.482479 | 102349600 | 0.0000 | 0.0 |
| 2018-08-08 00:00:00-04:00 | 49.229228 | 49.649724 | 48.863683 | 49.515930 | 90102000 | 0.0000 | 0.0 |
| 2018-08-09 00:00:00-04:00 | 50.060665 | 50.120394 | 49.503983 | 49.905369 | 93970400 | 0.0000 | 0.0 |
| 2018-08-10 00:00:00-04:00 | 49.715952 | 50.133130 | 49.550519 | 49.756710 | 98444800 | 0.1825 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2023-07-31 00:00:00-04:00 | 196.059998 | 196.490005 | 195.259995 | 196.449997 | 38824100 | 0.0000 | 0.0 |
| 2023-08-01 00:00:00-04:00 | 196.240005 | 196.729996 | 195.279999 | 195.610001 | 35175100 | 0.0000 | 0.0 |
| 2023-08-02 00:00:00-04:00 | 195.039993 | 195.179993 | 191.850006 | 192.580002 | 50389300 | 0.0000 | 0.0 |
| 2023-08-03 00:00:00-04:00 | 191.570007 | 192.369995 | 190.690002 | 191.169998 | 61235200 | 0.0000 | 0.0 |
| 2023-08-04 00:00:00-04:00 | 185.520004 | 187.380005 | 181.919998 | 181.990005 | 115799700 | 0.0000 | 0.0 |
1258 rows × 7 columns
If one's eye's scan between big intervals of years then one may pretty quickly come to the conclusion that the stocks prices for Apple have been climbing. What is not as easy to do by examining the data in this manner is tell how big the ups and downs in stock price have been. It would be easier to quickly get a feel for these qualities if we had even a simple plot. Let's get that right now. We set up a figure size reasonable for our notebook and add labels (all using matplotlib) to get a modest useful visual. Note that I have chosen to plot closing prices as the independent variable. However, I would likely get similar findings with opening prices as the gap; between them doesn't vary as much as the variance between either metric taken over long distances of time. It may be of interest to the reader to explore this topic on its own!
import matplotlib.pyplot as plt
# Plot the closing price
plt.figure(figsize=(10, 6))
plt.plot(tickerDf['Close'])
plt.title('Closing price of AAPL')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.show()
#!pip install statsmodels
from statsmodels.tsa.stattools import adfuller
# Perform ADF test on the 'Close' prices
result = adfuller(tickerDf['Close'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
ADF Statistic: -0.647774 p-value: 0.859818 Critical Values: 1%: -3.436 5%: -2.864 10%: -2.568
# Take the first difference of the closing prices
tickerDf['Close_diff'] = tickerDf['Close'].diff()
# Drop the missing values that were created by taking the difference
tickerDf = tickerDf.dropna()
# Perform ADF test on the differenced data
result = adfuller(tickerDf['Close_diff'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
ADF Statistic: -37.111304 p-value: 0.000000 Critical Values: 1%: -3.436 5%: -2.864 10%: -2.568
Correlation Discussion Followed By Our Analyses¶
When you use plot_acf and plot_pacf from statsmodels.graphics.tsaplots, the resulting plots provide a visual representation of the autocorrelation and partial autocorrelation of a time series, respectively. Here's how you can interpret these plots:
Plotting the Autocorrelation Function (ACF) with
plot_acf:Stationarity: If the bars (correlogram) in the ACF plot drop off quickly, the series might be stationary. If they remain significant over several lags, the series is likely non-stationary.
Seasonality: Regular patterns of peaks and troughs at consistent intervals can indicate seasonality. For instance, a peak every 12 lags in a monthly series suggests annual seasonality.
Randomness: If most bars are within the blue shaded region (confidence intervals) and are close to zero, the series might be random.
Model Identification: If the ACF cuts off after a certain number of lags, it suggests a possible MA order for an ARIMA model. If it declines gradually, an AR model might be more appropriate.
Plotting the Partial Autocorrelation Function (PACF) with
plot_pacf:Model Identification: The PACF plot can help identify the order of the AR term. If the PACF drops to zero after a certain number of lags, it suggests that an AR model of that order might be suitable. For instance, if the PACF is significant for 2 lags and then drops off, an AR(2) model might be a good fit.
Lagged Relationships: Like the ACF, significant spikes in the PACF at specific lags indicate a relationship between the data point and its lagged values, but after controlling for other lags.
Here's a simple example using Python to visualize the ACF and PACF: We will run it in the next cell.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Generate some example data (e.g., from a DataFrame)
# data = pd.read_csv('your_data.csv')
# ts = data['your_column']
# For demonstration purposes, let's use random data
ts = np.random.randn(100)
# Plot ACF
plot_acf(ts, lags=40)
plt.title('ACF Plot')
plt.show()
# Plot PACF
plot_pacf(ts, lags=40)
plt.title('PACF Plot')
plt.show()
When interpreting these plots, it's essential to consider the blue shaded region, which represents the confidence intervals. Bars that extend beyond this region are considered statistically significant.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Generate some example data (e.g., from a DataFrame)
# data = pd.read_csv('your_data.csv')
# ts = data['your_column']
# For demonstration purposes, let's use random data
ts = np.random.randn(100)
# Plot ACF
plot_acf(ts, lags=40)
plt.title('ACF Plot')
plt.show()
# Plot PACF
plot_pacf(ts, lags=40)
plt.title('PACF Plot')
plt.show()
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Plot the ACF
plot_acf(tickerDf['Close_diff'], lags=50)
plt.show()
# Plot the PACF
plot_pacf(tickerDf['Close_diff'], lags=50)
plt.show()
ARIMA¶
We are
from statsmodels.tsa.arima.model import ARIMA
# Fit an ARIMA(1,1,1) model
model = ARIMA(tickerDf['Close'], order=(1,1,1))
model_fit = model.fit()
# Print out the summary of the model
print(model_fit.summary())
SARIMAX Results
==============================================================================
Dep. Variable: Close No. Observations: 1257
Model: ARIMA(1, 1, 1) Log Likelihood -2834.265
Date: Sun, 06 Aug 2023 AIC 5674.530
Time: 14:51:30 BIC 5689.937
Sample: 0 HQIC 5680.320
- 1257
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.4184 0.288 1.453 0.146 -0.146 0.983
ma.L1 -0.4739 0.282 -1.683 0.092 -1.026 0.078
sigma2 5.3400 0.138 38.709 0.000 5.070 5.610
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 455.93
Prob(Q): 0.99 Prob(JB): 0.00
Heteroskedasticity (H): 5.07 Skew: -0.08
Prob(H) (two-sided): 0.00 Kurtosis: 5.95
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Fetch the historical stock data for Microsoft (MSFT)
msft = yf.Ticker('MSFT')
# Get the historical prices for this ticker
msftDf = msft.history(period='1d', start='2020-1-1', end='2023-1-1')
# Print the first few rows of the data
msftDf.head()
| Open | High | Low | Close | Volume | Dividends | Stock Splits | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2020-01-02 00:00:00-05:00 | 153.641607 | 155.528499 | 153.206173 | 155.422058 | 22622100 | 0.0 | 0.0 |
| 2020-01-03 00:00:00-05:00 | 153.196506 | 154.773747 | 152.944911 | 153.486786 | 21116200 | 0.0 | 0.0 |
| 2020-01-06 00:00:00-05:00 | 151.996623 | 153.951256 | 151.445062 | 153.883514 | 20813700 | 0.0 | 0.0 |
| 2020-01-07 00:00:00-05:00 | 154.164134 | 154.502799 | 152.228858 | 152.480438 | 21634100 | 0.0 | 0.0 |
| 2020-01-08 00:00:00-05:00 | 153.786716 | 155.596209 | 152.838435 | 154.909180 | 27746500 | 0.0 | 0.0 |
The Apple and the Orange Meet¶
# Import the necessary library
import matplotlib.pyplot as plt
# Plot the closing prices for Apple and Microsoft
plt.figure(figsize=(14, 7))
plt.plot(tickerDf['Close'], label='Apple')
plt.plot(msftDf['Close'], label='Microsoft')
plt.title('Apple vs Microsoft - Closing Prices')
plt.xlabel('Date')
plt.ylabel('Closing Price (USD)')
plt.legend()
plt.show()
Future Directions¶
This section will continue to develop so please revisit and contact me if you have any ideas or would like me to dive more deeply into a question you have. Thank you for visiting!
Other Large and Silently Growing Companies¶
# Fetch the historical stock data for The Home Depot (HD)
hd = yf.Ticker('HD')
# Get the historical prices for this ticker
hdDf = hd.history(period='1d', start='2020-1-1', end='2023-1-1')
# Print the first few rows of the data
hdDf.head()
| Open | High | Low | Close | Volume | Dividends | Stock Splits | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2020-01-02 00:00:00-05:00 | 201.584009 | 202.209696 | 200.443032 | 202.117691 | 3935700 | 0.0 | 0.0 |
| 2020-01-03 00:00:00-05:00 | 199.798942 | 202.136088 | 199.440088 | 201.445984 | 3423200 | 0.0 | 0.0 |
| 2020-01-06 00:00:00-05:00 | 199.200855 | 202.430537 | 199.118032 | 202.393738 | 5682800 | 0.0 | 0.0 |
| 2020-01-07 00:00:00-05:00 | 201.970488 | 202.945833 | 199.578122 | 201.068756 | 5685400 | 0.0 | 0.0 |
| 2020-01-08 00:00:00-05:00 | 201.326415 | 205.163393 | 201.206792 | 204.077621 | 4916200 | 0.0 | 0.0 |